First things, first, we need to have a look at the number of observations.
## [1] 1599 12
Turns out there are 1599 observations and 12 variables
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Turns out there are 1599 observations and 12 variables
The plots above provide some detail on the range and distribution of values in the 12 variables that we are dealing with.
The variable we will mainly focus on is obviously the ‘quality’ variable, it seems, quality has been defined by discrete numbers from 3 to 8.
There are:
For the sake of this project an ‘Outlier’ can be defined as point that were above (Mean + 2 X SD) and any points below (Mean - 2 X SD).
The boxplot however will be more stricter in classifying points as outliers and the outliers described by the red dots in the the boxplots are meant only for visualization purposes and not taken literally since when it comes to removing outliers, it is upto each individual’s discretion in deciding which points fall above the threshold of ‘remove because it is an outlier’
## [1] 0.00750469
‘fixed.acidity’ variable has 0.00750469% of outliers
## [1] 0.006253909
‘volatile.acidity’ variable has 0.006253909% of outliers
## [1] 0.0006253909
‘citric.acid’ variable has 0.0006253909% of outliers
## [1] 0.01876173
‘residual.sugar’ variable has 0.01876173% of outliers
## [1] 0.01938712
‘chlorides’ variable has 0.01938712% of outliers
## [1] 0.0137586
‘free.sulfur.dioxide’ variable has 0.0137586% of outliers
## [1] 0.009380863
‘total.sulfur.dioxide’ variable has 0.009380863% of outliers
## [1] 0.01125704
‘density’ variable has 0.01125704% of outliers
## [1] 0.005003127
‘pH’ variable has 0.005003127% of outliers
## [1] 0.01688555
‘sulphates’ variable has 0.01688555% of outliers
## [1] 0.005003127
‘alcohol’ variable has 0.005003127% of outliers
There are 1599 observations of red wines and 12 variables.
‘quality’ is the main feature of the dataset. It is an ordered, categorical variable ranging from 3 to 8, 3 being the worst and 8 being the best.
The only to know this would be to perform multivariate analysis and check the correlation coefficients. It is hard to make any inference as to which of the other 11 numerical features would impact the main feature. Some may be positive correlation, some negatively, and some would have nothing to do with the main feature what so ever. Before diving any further, one can surmise that the features that pertain to acidity (fixed.acidity, volatile.acidity and citric.acid) will play some role in the quality of red wine.
Looking into the dataset it would be meaningless to create new features. The three features that pertain to acidity (fixed.acidity, volatile.acidity and citric.acid) could be combined to form a new feature called total.acidity, but then again, each feature pertaining to acidity plays a very specific role. It would be counterproductive to add up one feature related to acidity that is positively correlated and another feature related to acidity that is negative correlated, since each of the acidic features exist as seperate entities. Other candidates for new features would be total.sulphur.dioxide and free.sulphur.dioxide, total SO2 represents the bound SO2 + free SO2, subtracting total - free, would give you the bound SO2, but then again, why seperately create this feature when total.sulphur.dioxide is already a superset that adequately represents bound SO2.
The top 5 features from highest to lowest percent of outliers are
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
A self explanatory summary of all the variable above
The variable of focus is going to be ‘quality’. This project will endeavour to find the correlations between other variables and our main variable ‘quality’.
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
redwine$quality <-as.numeric(factor(redwine$quality,
levels=c("3","4","5","6","7","8")))
ggcorr(redwine, label=TRUE, hjust = 0.9, size = 3, color = "grey50", angle = -45)
redwine$quality <- mapvalues(redwine$quality, from =
c("1","2","3","4","5","6"),
to = c("3","4","5","6","7","8"))
redwine$quality <-as.factor(redwine$quality)
According to the correlation matrix in the multivariate section, some things about the variable ‘quality’ are apparent, and are confirmed in the graphs of the bivariate sections
The variable ‘quality’ has the strongest positive correlation with the variable ‘alcohol’
According to the matrix, some things about the variable ‘quality’ are apparent
The main variable (Quality rating out of 10) in focus is an integer, it would be helpful to convert it to a factor.
Since, alcohol and sulphates are both positive correlated with quality, let’s pair them up against each other and group them by quality.
A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.
It is clear from the above graphs that, for higher qualities of wine rated ‘6’ onwards tend to have more sulphate content in them irrespective of alcohol content.
Since, alcohol is positive correlated with quality, density is negatively correlated with quality. Let’s pair them up against each other and group them by quality.
A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.
One main distinction between red wines marked with a rating ‘3’ and those marked with a rating ‘8’ is the difference in their densities. High quality red wines have densities dropping below 0.995.
Since, Residual sugar content has no correlation with quality, while alcohol content has a positive correlation with quality. Let’s pair them up against each other and group them by quality.
A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.
It is very hard to make distinctions between quality based on alcohol content and residual sugars, there appears to be a lot of overlap.
Since, alcohol is positive correlated with quality, pH value is somewhat negatively correlated with quality. Let’s pair them up against each other and group them by quality.
A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.
Red wines with quality rating ‘6’,‘7’ and ‘8’ are very close to when it comes to their pH values, you can see the lines for those corresponding qualities moving in the same path and ending close to each other.
We already know that red wines having high concentrations of acetic acid are poor quality since they tend to have a vinegar taste. Let’s draw a graph that pits it against % of alcohol content grouped by quality.
A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.
One distinction that we can easily making here, is that, poor quality alcohols rated ‘3’ have high amounts of acetic acid in them as denoted by line. Other higher quality red wines with ratings ‘6’, ‘7’ and ‘8’ have acetic acid concentration around the same mark.
It is hard to infer the quality of red wine using Tartaric Acid Concentrations ALONE because of such a fluctuation in the mean and median Tartaric Acid Concentrations. For redwines marked with the quality ‘7’, the mean and median Tartaric Acid Concentrations is constant at about 8.8. Let us now analyse how the quality changes with % of alcohol content
A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.
Good quality red wines rated ‘6’, ‘7’ and ‘8’ with alcohol content from 12.5% onwards have the almost the same concentration of Tartaric acid in them.
As the wine quality increases, citric acid concentration increases. As postulated in the data dictionary, citric acid can add ‘freshness’ and flavor to wines. This increase in concentration of the acid, adds more freshness. Lets see its variation with alcohol content.
A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.
Among the best quality red wines rated ‘6’, ‘7’ and ‘8’, as the alcohol content in them increases, the citric acid concentration falls. However, when viewing citric acid alone vs. quality, the better quality red wines merely had high citric acid concentrations. Among better wines, the increase in alcohol concentration offsets the decrease in citric acid concentrations.
A scatter plot with the corresponding color to the regression line for each category to depict the separation has been added. To make the above plot clearer, a geom_smooth plot has been added below.
The salt content is almost the same in red wines rated ‘7’ and ‘8’. Otherwise it is hard to come to meaningful conclusions just based on salt content.
Among the best quality red wines rated ‘6’, ‘7’ and ‘8’, as the alcohol content in them increases, the citric acid concentration falls. However, when viewing citric acid alone vs. quality, the better quality red wines merely had high citric acid concentrations. Among better wines, the increase in alcohol concentration offsets the decrease in citric acid concentrations.
One distinction that we can easily making here, is that, poor quality alcohols rated ‘3’ have high amounts of acetic acid in them as denoted by line. Other higher quality red wines with ratings ‘6’, ‘7’ and ‘8’ have acetic acid concentration corresponding to the respective alcohol content.
It is very hard to make distinctions between quality based on alcohol content and residual sugars, there appears to be a lot of overlap. There appears to be no correlation between residual sugars and quality of red wine whatsoever.
It is clear from the above graphs that, for higher qualities of wine rated ‘6’ onwards tend to have more sulphate content in them irrespective of alcohol content. This is what makes great quality red wines great.
There are so many correlated variables, and when performing multivariate analysis, we see that we can find one thing that completely stands out in poor quality red wines. Poor quality read wines that have been rated ‘3’ on a scale of 10 tend to have higher concentrations of acetic acid. The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
There are so many correlated variables, and when performing multivariate analysis, we see that we can find one thing that completely stands out in great quality red wines. Great quality read wines that have been rated ‘7’ and ‘8’ on a scale of 10 tend to have higher concentrations of Pottasium sulphates irrespective of the alcohol concentrations.
The quality of redwine varies randomly with the amount of residual sugar in it proving the existence of little to no correlation between the two variables. The boxplot helps understand the variance in each category.
The differences in number of observations for each class made it difficult to analyse the quality. There were no perfectly obvious or perfect correlations, and that is probably because of this. For example, there are 10 observation rated 3 on 10 but 681 observations rated 5 on 10. This makes it hard to ascertain for sure the quality of poor wines since there may be very few training examples.
Python offer more aesthetically pleasing plots, graphs and charts with seaborn package. That was not available in R. (or maybe I haven’t discovered them)
Using geom_smooth instead of geom_point made a ton of difference to articulate useful observations in the Multivariate Analysis section (Division number by 4 and its subdivisions)
Sometimes, multiple legends depicting the same thing popped up in graphs, using the write parameter made a ton of difference.
There were so many ways to visualize data, that merely picking the right type of visualization was a challenge.
Although ordered categorical variables needs to be displayed using different color palettes, it makes it hard to visualize the difference between the magnitudes in the category since there is so much overlap between lightly shaded lines of the same color and a darker shade of the same color.
At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). It would help to know how each wine expert rated them. That would tell more about the wine expert itself! One wine expert would’ve given more higher ratings to wine with more alcohol content. Another wine expert would’ve rated higher on a red wine with a higher acetic acid concentration.
It would also help to know what varieties of grapes that were used and in what proportion, since grapes are the core ingredients of red wine.